The violent crime rate in U.S increased by 3.4 percent nationwide in 2016 in US. As an international student, as well as a New Yorker, the public safety in NYC is always a concern to us, especially after the recent terrorists attack near the World Trade Center. Thus, our group decided to make a deeper investigation of the crime data and seek out some underlying reasons which led to the increase of crime rate.
NYPD official website provides citywide histroic crime data in forms of excel. We downloaded these datasets and merged them into the nyc_crime_hist. The resulting data frame contain information about the total number of offenses from 2000 to 2016 and major offense categories(felony, misdemeanor, and violation) and detailed descriptions.
nyc_hist_vio = read_excel("./historic/violation-offenses-2000-2016.xls", range = "A4:R6") %>%
mutate(ofns_type = "VIOLATION")
nyc_hist_felony_7 = read_excel("./historic/seven-major-felony-offenses-2000-2016.xls", range = "A5:R12") %>%
mutate(ofns_type = "FELONY")
nyc_hist_felony = read_excel("./historic/non-seven-major-felony-offenses-2000-2016.xls", range = "A5:R13") %>%
mutate(ofns_type = "FELONY")
nyc_hist_mis = read_excel("./historic/misdemeanor-offenses-2000-2016.xls", range = "A4:R21")%>%
mutate(ofns_type = "MISDEMEANOR")
nyc_crime_hist = nyc_hist_mis %>%
full_join(nyc_hist_felony) %>%
full_join(nyc_hist_felony_7) %>%
full_join(nyc_hist_vio) %>%
mutate(ofns_type = as.factor(ofns_type), ofns_desc = OFFENSE) %>%
select(-OFFENSE)
We focus our efforts on the data of current year 2017 which is obtained from NYC_OpenData. It includes all valid felony, misdemeanor, and violation crimes reported to the NYPD till October in this year. The latest update of this dataset is October 25, 2017.
nyc_crime_2017 = read_csv("./NYPD_Complaint_Data_Current_YTD.csv") %>%
clean_names()
nyc_crime_2017 = nyc_crime_2017 %>%
mutate(cmplnt_fr_dt = as.Date(cmplnt_fr_dt, "%m/%d/%Y")) %>%
select(cmplnt_fr_dt, cmplnt_fr_tm, ky_cd, ofns_desc, law_cat_cd, boro_nm, prem_typ_desc, longitude, latitude) %>%
filter(year(cmplnt_fr_dt) == 2017) %>%
rename(date = cmplnt_fr_dt, time = cmplnt_fr_tm, prem_typ = prem_typ_desc, ofns_type = law_cat_cd, boro = boro_nm)
Since the dataset has 341716, 9, we randomly sample 50000 observations and creat an interactive map showing locations where the crimes in New York City occured:
sample <- nyc_crime_2017[sample(1:nrow(nyc_crime_2017), 50000,replace=FALSE),]
sample %>%
mutate(text_label = str_c("Offense desc:", ofns_desc, ' Boro: ', boro)) %>%
plot_ly(x = ~longitude, y = ~latitude, type = "scatter", mode = "markers",
alpha = 0.5,
color = ~ofns_type,
text = ~text_label)
The historic data shows the trend of total crime numbers over years. We can see that the total number of crimes are decreasing since 2000. After 2005, the crimes numbers of misdemeanor increased and dropped again after 2010.
nyc_crime_hist = nyc_crime_hist %>%
gather(key = year, value = count, "2000":"2016") %>%
group_by(year, ofns_type) %>%
summarize(n = sum(count)/12) %>%
full_join(nyc_crime_2017 %>%
group_by(ofns_type) %>%
summarize(n = n()/9) %>%
mutate(year = "2017")) %>%
ungroup()
nyc_crime_hist %>%
mutate(year = as.numeric(year)) %>%
ggplot(aes(x = year, y = n, fill = ofns_type)) + geom_bar(stat = "identity")
In 2017, we plot a bar chart showing crime number and offense type in different boros:
barplot = nyc_crime_2017 %>%
mutate(boro = fct_infreq(boro)) %>%
ggplot(aes(x = boro, fill = ofns_type)) + geom_bar()
ggplotly(barplot)
And also in different months:
nyc_crime_2017 %>%
group_by(date, ofns_type) %>%
summarize(crime_count = n()) %>%
ggplot(aes(x = date, y = crime_count, color = ofns_type)) +
geom_point(alpha = .6) + geom_smooth() +
theme(legend.position = "bottom")
The first graph reveals the number of crime vs boro. Among 5 boros, Brooklyn has the highest number of crime in the frist 10 months in 2017. Offense type includes felony, misdemeanor and violation. Misdemeanor is the most frequent offense types across these 5 boros. Next, we made a second plot showing total crime based on different offense type in the frist 10 months in 201, the results are similar, also indicate that Misdemeanor is significantly higher than VIOLATION and FELONY.
We then focused on crime data of current year.
Make a plot of crime count versus hour in a day and group by boro. It shows that most crimes happened between 15:00 and 20:00.
nyc_crime_2017 %>%
mutate(hour = hour(time)) %>%
group_by(hour, boro) %>%
summarize(n = n()) %>%
ggplot(aes(x = hour, y = n, color = boro)) + geom_point(alpha = 0.5) + geom_line()
Make a plot showing the crime numbers and crime rate based each months
crime_tidy = nyc_crime_2017 %>%
mutate(month = month(date)) %>%
group_by(month,boro) %>%
summarize(crime_count = n())
crimetotal = ggplot(crime_tidy, aes(x = month, y = crime_count, color = boro)) +
geom_point() + geom_path(aes(group = boro)) +
theme(legend.position = "bottom")
crime_rate = crime_tidy %>%
mutate(popluation = recode(boro, "BRONX" = 1455720,
"BROOKLYN" = 2629150,
"MANHATTAN" = 1643734,
"QUEENS" = 2333054,
"STATEN ISLAND" = 476015)) %>%
mutate(crime_rate = (crime_count/popluation)*100000)
crimerate = ggplot(crime_rate, aes(x = month, y = crime_rate, color = boro)) +
geom_point() + geom_path(aes(group = boro)) +
theme(legend.position = "bottom")
library(ggpubr)
ggarrange(crimetotal, crimerate, ncol = 2, common.legend = TRUE)
Here, we would like to make a deeper investigation about the crime numbers and crime rate based on each month this year. In order to calculate the crime rate, we need to use the population data of NYC. We get this data from the website http://www1.nyc.gov/site/planning/data-maps/nyc-population/current-future-populations.page.We can see from the results that Brooklyn has the most crime numbers this year, but in crime rate, Bronx is the worst. Queens is relatively safer. Also, we could find that in February, there are usually fewer crimes, that’s probably because the weather in February is usually the coldest, makes criminal less willing to go on the street.
Comments * In Figure a, we presented the top five places where crimes usually happlen across five boros in NYC and it shows that STREET and RESIDENCE are the most unsafe places, then we will look furtherly about the major crime types in these two places for each boro.
* We obtian the information of crime counts and types through the width and partitioning of bars. It is obviously to conclude that the prevalence of assault, harrassment and criminal mischief are much higher compared to other crimes in most boros.
* In residence, the occurence of harrassment and assult is more prevalent, while prtit larcency and criminal mischief represent more percentage of criminal types in street.
* Next, we want to compare the major criminal types in different boros. From Figure b-f, we found that the distribution of crimes are similar among Manhattan, Brooklyn, State Island and Queen. However, the characteristic of crimes in Bronx appeares to be more complex. Specificlly, the nature of crimes in Bronx is usually more serious than the other four boros, with the occurence of FELONY ASSAULT and DANGEROUS DRUGS, which belong to felony.
* Overall,the safty level in Manhattan, State Island and Queens in relatively higher
income = read_csv("./NYC_Income_by_Borough.csv") %>%
clean_names() %>%
mutate(boro = borough) %>%
select(-borough)
crime_income = left_join(income, nyc_crime_population, by = "boro")
crime_income %>%
ggplot(aes(x = income, y = crime_rate, color = income)) + geom_point(alpha = 0.5) + geom_smooth() +
labs(title = "Corelation between family median income and crime rate in each borough",
x = "Income Range",
y = "Crime rate")
In addition, we have a strong interest in finding potential factors that may associated with criminal rate. In this case, we choose household income level. After reading data from the web, data cleaning and data visualization, we are surprized to see from the scatter plot: Both lower-income borough and higher-income borough have an extremely high crime rate. For example, Bronx borough’s family median income is 35176 dollars, associated with a crime rate of 0.029. That is, we expect 29 crime cases among every 1000 people. In contrast, Family income ranged between 60000 dollars to 70000 dollars tends to have the lowerest crime rate. Taking Queens as an example, we expect only 15 crime cases among every 1000 people.
library(tidytext)
crime_words = nyc_crime_2017 %>%
select(-longitude, -latitude) %>%
mutate(ofns_desc = str_to_lower(ofns_desc),
ofns_desc = str_replace(ofns_desc, "[2-3]",""),
ofns_desc = as.character(ofns_desc)) %>%
unnest_tokens(word, ofns_desc)
data(stop_words)
crime_word_tidy =
anti_join(crime_words, stop_words)
crime_word_tidy %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = .6) +
coord_flip()
The graph analyzes top 10 words showing in offense description. The most frequent one is larceny, which appears nearly 100000 times. Other frequent words including related, petit, assault, harrassment, etc. Most of them indicated the type of crime, which is consistent with what we expect.
word_ratios = crime_word_tidy %>%
filter(ofns_type %in% c("VIOLATION" , "FELONY")) %>%
count(word, ofns_type) %>%
group_by(word) %>%
filter(sum(n) >= 5) %>%
ungroup() %>%
spread(ofns_type, n, fill = 0) %>%
mutate(
violation_odds = (VIOLATION + 1) / (sum(VIOLATION) + 1),
felony_odds = (FELONY + 1) / (sum(FELONY) + 1),
log_OR = log(felony_odds / violation_odds)
) %>%
arrange(desc(log_OR))
word_ratios %>%
mutate(pos_log_OR = ifelse(log_OR > 0, "felony_odds >violation_odds" ,"violation_odds > felony_odds")) %>%
group_by(pos_log_OR) %>%
top_n(10, abs(log_OR)) %>%
ungroup() %>%
mutate(word = fct_reorder(word, log_OR)) %>%
ggplot(aes(word, log_OR, fill = pos_log_OR)) +
geom_col() +
coord_flip() +
ylab("log odds ratio (felony_odds/violation_odds)") +
scale_fill_discrete(name = "") +
theme(legend.position = "bottom")
The above chart compares distinct words(that is, words that appear much more frequently in one group than the other) in offense type of violation and felony. We can see that larceny, robbery, burglary,etc., appear more frequently in offense description of felony crime, while harrassment, gambling, loitering appear more frequently in offense description of violation crime. In terms of the results, we can obtain a basic picture of the difference between felony and violation.